Method for efficient dat transformation

ABSTRACT

A digraph including a plurality of ordinary nodes, at least one of a composition node and a decomposition node, and a plurality of arcs interconnecting any of said nodes.

FIELD OF THE INVENTION

[0001] The present invention relates to data transformation.

BACKGROUND OF THE INVENTION

[0002] In today's business environment, many applications and solutionsneed to use data that are expressed in different formats and languages.Effective use of data often requires that data be transformed from onedata format into another.

[0003] For example, healthcare providers, such as physicians, createlarge volumes of patient information at healthcare facilities, such ashospitals, clinics, laboratories, and medical offices. Often, a patientmay be treated by more than one healthcare provider, necessitating thatthe patient's records at one healthcare provider be readily available toother healthcare providers, as this information might be critical to thehealthcare provider when treating the patient. Unfortunately, the widevariety of formats in which information is stored might impede thehealthcare provider's ability to assimilate the information. Althoughmedical data may be converted from one format to another to facilitatedata interchange and thereby potentially improve patient care, doing soefficiently and at a minimum cost is vital in light of spiraling medicalcosts.

[0004] In many cases, transformation of data from a source format into atarget format is carried out as a series of transformations to one ormore intermediate data formats. While transformations might also berequired that unify multiple data formats (i.e., many-to-onecardinality), that result in several target formats (i.e., one-to-manycardinality), or both (i.e., many-to-many cardinality), techniques fordetermining the most efficient paths for transformations of variouscardinalities do not currently exist.

SUMMARY OF THE INVENTION

[0005] The present invention discloses a method for efficient datatransformation, particularly where multiple transformation paths areavailable and where transformations may be one-to-one, one-to-many,many-to-one, or many-to-many. The present invention facilitatesefficient data transformation and interchange in fields such as, but notlimited to, medical records management, multimedia production, andbusiness data warehousing.

[0006] In the present invention, a table of data transformations andtheir associated costs are expressed in a single digraph, where sourceand target data formats are represented as nodes connected bycost-labeled arcs. Each arc connecting the nodes has a nonnegative cost,where the cost may be expressed in terms of transformation executiontime, labor costs, or any other costs. The most efficient transformationpaths from the sources to the targets are then determined as thosetransformation paths that incur the lowest accumulated cost.

[0007] In one aspect of the present invention a digraph is providedincluding a plurality of ordinary nodes, at least one of a compositionnode and, a decomposition node, and a plurality of arcs interconnectingany of the nodes.

[0008] In another aspect of the present invention the composition nodeis connected to at least two of the nodes via arcs incoming to thecomposition node and to one other of the nodes via an arc outgoing tothe other node.

[0009] In another aspect of the present invention the decomposition nodeis connected to one of the nodes via an arc incoming to the compositionnode and to at least two other of the nodes via arcs outgoing to theother nodes.

[0010] In another aspect of the present invention the ordinary nodesrepresent data formats.

[0011] In another aspect of the present invention a first one of theordinary nodes connected via an outgoing one of the arcs to a second oneof the ordinary nodes represents a transformation of one data formatinto another.

[0012] In another aspect of the present invention any of the arcs has anassociated non-negative cost.

[0013] In another aspect of the present invention a method is providedfor constructing a digraph from a plurality of source-to-targettraversals, the method including representing the sources and targets asa plurality of ordinary nodes, representing any of the traversals havinga one-to-one cardinality by connecting the source node of the traversalto the target node of the traversal by an arc outgoing from the sourcenode, and performing any of the following representing any of thetraversals having a many-to-one cardinality by connecting the sourcenodes of the traversal to a composition node by arcs outgoing from thesource nodes, and by connecting the composition node to the target nodeof the traversal by an arc outgoing from the composition node,representing any of the traversals having a one-to-many cardinality byconnecting the source node of the traversal to a decomposition node byan arc outgoing from the source node, and by connecting thedecomposition node to the target nodes of the traversal by arcs outgoingfrom the decomposition node, and representing any of the traversalshaving a many-to-many cardinality by connecting the source nodes of thetraversal to a composition node by arcs outgoing from the source nodes,by connecting the composition node to a decomposition node by an arcoutgoing from the composition node, and by connecting the decompositionnode to the target nodes of the traversal by arcs outgoing from thedecomposition node.

[0014] In another aspect of the present invention the method furtherincludes associating a non-negative cost with any of the arcs.

[0015] In another aspect of the present invention a method is providedof efficient path discovery in a digraph including a plurality ofordinary nodes, at least one of a composition node and a decompositionnode, and a plurality of arcs interconnecting any of the nodes, themethod including providing a source node s connected to a set S ofsource nodes in the digraph via outgoing arcs of zero cost, initializingto zero a cumulative cost of the path to s, providing a composition nodet′ connected to a set T of target nodes in the digraph via incoming arcsof zero cost, providing a target node t connected to composition node t′via an incoming arc of zero cost, defining a set W of nodes in thedigraph initially including only node s, defining a set V of all nodesin the digraph, determining the cumulative costs of the paths to allnodes y in V that are connected to node s by an arc, while W<>Vselecting a node x in V from all nodes in V that are not in W whosecumulative cost is minimal, adding node x to W, and for each node y in Vto which x has an outgoing arc if y is not a composition node,determining the cumulative cost of the path to y as the lesser of a) thecurrent known cumulative cost of the path to y, and b) the cumulativecost of the path to x plus the cost of the arc connecting x to y, if yis a composition node, and all nodes that have outgoing arcs to y are inW, determining the cumulative cost of the path to y as the lesser of a)the current known cumulative cost of the path to y, and b) the sum ofthe cumulative costs of the paths to all nodes that have outgoing arcsto y.

[0016] In another aspect of the present invention the method furtherincludes determining the most efficient path from node s to adestination node selected from any of the nodes as including the arcswhose cost was added to the final cumulative cost of the destinationnode.

[0017] In another aspect of the present invention the method furtherincludes determining the most efficient path from S to T as includingthe arcs whose cost was added to the final cumulative costs of the nodesof T.

[0018] In another aspect of the present invention the step ofdetermining the most efficient path includes a) traversing each incomingarc of each node in T whose cost was added to the final cumulative costof each current node, to arrive at one or more next nodes in the path,b) traversing each incoming arc of each node arrived at in the previousstep whose cost was added to the final cumulative cost of each nodearrived at in the previous step, to arrive at one or more next nodes inthe path, and c) repeating step b) until the currently-arrived-at nodesare the nodes of S, where the traversed arcs together form the mostefficient path from S to T.

[0019] In another aspect of the present invention a computer program isprovided embodied on a computer-readable medium, the computer programincluding a first code segment operative to provide a source node sconnected to a set S of source nodes in a digraph via outgoing arcs ofzero cost, the digraph including a plurality of ordinary nodes, at leastone of a composition node and a decomposition node, and a plurality ofarcs interconnecting any of the nodes, a second code segment operativeto initialize to zero a cumulative cost of the path to s, a third codesegment operative to provide a composition node t′ connected to a set Tof target nodes in the digraph via incoming arcs of zero cost, a fourthcode segment operative to provide a target node t connected tocomposition node t′ via an incoming arc of zero cost, a fifth codesegment operative to define a set W of nodes in the digraph initiallyincluding only node s, a sixth code segment operative to define a set Vof all nodes in the digraph, a seventh code segment operative todetermine the cumulative costs of the paths to all nodes y in V that areconnected to node s by an arc, an eighth code segment operative, whileW<>V, to select a node x in V from all nodes in V that are not in Wwhose cumulative cost is minimal, add node x to W, and for each node yin V to which x has an outgoing arc if y is not a composition node,determin the cumulative cost of the path to y as the lesser of a) thecurrent known cumulative cost of the path to y, and b) the cumulativecost of the path to x plus the cost of the arc connecting x to y, if yis a composition node, and all nodes that have outgoing arcs to y are inW, determin the cumulative cost of the path to y as the lesser of a) thecurrent known cumulative cost of the path to y, and b) the sum of thecumulative costs of the paths to all nodes that have outgoing arcs to y.

BRIEF DESCRIPTION OF THE DRAWINGS

[0020] The present invention will be understood and appreciated morefully from the following detailed description taken in conjunction withthe appended drawings in which:

[0021]FIGS. 1A, 1B, 1C, and 1D are simplified illustrations of digraphelements, constructed and operative in accordance with a preferredembodiment of the present invention;

[0022]FIG. 2, which is a simplified illustration of a digraph,constructed and operative in accordance with a preferred embodiment ofthe present invention;

[0023]FIGS. 3A and 3B, taken together, is a simplified flowchartillustration of a method of efficient path discovery in a digraph,operative in accordance with a preferred embodiment of the presentinvention;

[0024]FIGS. 4A, 4B, 4C, and 4D are simplified illustrations of thedigraph of FIG. 2 reflecting the application of the method of FIGS. 3Aand 3B.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0025] Reference is now made to FIGS. 1A, 1B, 1C, and 1D, which aresimplified illustrations of digraph elements, constructed and operativein accordance with a preferred embodiment of the present invention. Inaccordance with the present invention, a digraph is constructed torepresent one or more source-to-target traversals, such as atransformation of data from a source format into a target format, andtheir attendant costs. If a transformation t has a cost=1, a cardinalityof one-to-one, and transforms data from format {a1} to format {b1}, thesub-graph of FIG. 1A is created, where {a1} and {b1} are represented bynodes 10 interconnected by an arc 12 whose direction indicates thelogical direction of the transformnation. If t has a cost=1, acardinality of one-to-many, and transforms data from format {a1} toformats {b1,b2}, the sub-graph of FIG. 1B is created. The one-to-manycardinality is represented by a square-shaped decomposition node 14. Ift has a cost=1, a cardinality of many-to-one, and transforms data fromformats {a1,a2} to format {b1}, the sub-graph of FIG. 1C is created. Themany-to-one cardinality is represented by a plus-shaped composition node16. The composition node of the present invention requires thattransformation t may be executed only after all data in source formats{a1,a2} exist. If t has a cost=1, a cardinality of many-to-many, andtransforms data from formats {a1,a2,a3} to formats {b1,b2}, thesub-graph of FIG. 1D is created, combining the sub-graphs of FIGS. 1Band 1C. For the sake of clarity, a node that is neither a compositionnode nor a decomposition node is herein referred to as an ordinary node.

[0026] Reference is now made to FIG. 2, which is a simplifiedillustration of a digraph, constructed and operative in accordance witha preferred embodiment of the present invention. The digraph of FIG. 2is an exemplary construction using the sub-graphs of FIGS. 1A 1D whichmodels a number of transformations, shown in Table A, from source dataformats, such as flat file and database (db), into target data formats,such as rtfXml and dbXml respectively. Each transformation is associatedwith a cost, which may represent any non-negative relevant cost, such asprocessing time, memory used, number of computer operations, etc. TABLEA Sources Targets Cost flatFile rtfXml 1 db dbXml 3 db 123 1 wordprohtml 2 123 html 0.5 123 dbXml 1 rtfXml, dbXml performanceML 1 html,dbXml performanceML 1 rtfXml, dbXml XHTML 2 html text, gif 2

[0027] It will be seen in Table A that some transformations may requiremultiple sources, transformation of rtfXml and dbXml into performanceML,and that some split a single source into multiple target formats, suchas the transformation of html into text and gif.

[0028] For both diagrammatic clarity and generality of description, eachsource and target of Table A may be expressed as numbered nodes, such asis shown in Table B as follows: TABLE B Source Nodes Target Nodes Cost102 116 1 104 118 3 104 120 1 106 122 2 120 122 0.5 120 118 1 116, 118110 1 122, 118 110 1 116, 118 108 2 122 112, 114 2

[0029] The digraph of FIG. 2 is constructed to represent the collectionof source nodes and target nodes, where each transformation is depicted,using the sub-graphs of FIGS. 1A 1D, as a set of arcs and nodes, whereeach arc is shown together with its associated cost.

[0030] Source node 104 is shown connected both to node 118 and to node120. This type of digraph notation may be used to represent, forexample, that a single source data format may be transformed into eitherof two different data formats, represented by node 118 and node 120respectively. The arc from node 120 to node 118 may also be used to showthat the source data format represented by source node 104 may betransformed into the data format represented by node 118, first viatransformation into the data format represented by node 120, and thenvia transformation into the data format represented by node 118.

[0031] Many-to-one cardinality, such as where data in two or more dataformats are to be combined into a single data format, is shownrepresented by plus-shaped composition nodes 124, 126, and 128, whoseincoming arcs are assigned a zero cost. One-to-many cardinality, such aswhere one data format is to be split into or otherwise transformed intotwo or more different data formats, is shown represented by asquare-shaped decomposition node 130. The outgoing arcs fromdecomposition nodes are also assigned a zero cost.

[0032] The constructed digraph of the present invention may be used asthe basis for formulating a query whose purpose is to determine the mostefficient path between any two nodes or between any two groups of nodes,where efficiency is defined as the lowest cumulative cost of the arcsalong a given path. The decision whether to traverse a particular arcmay be made by considering the cumulative cost of the arcs traversed andselecting the path having the lowest cumulative cost. A preferred methodof efficient path discovery in the digraph of the present invention isnow described.

[0033] Reference is now made to FIGS. 3A and 3B, which, taken together,is a simplified flowchart illustration of a method of efficient pathdiscovery in a digraph, operative in accordance with a preferredembodiment of the present invention. In the method of FIGS. 3A and 3B,given a set S of source nodes, a set T of target nodes, and variouspaths therebetween, a single source node s is introduced into thedigraph and connected via outgoing arcs of zero cost to each of thesource nodes S. A cumulative cost of the path to s is typicallyinitialized to zero, since s is the node of origin for all source nodes.A single composition node t′ is also defined and connected via incomingarcs of zero cost to each of the target nodes T. A single target node tis likewise defined and connected via an incoming arc of zero cost tothe composition node t′.

[0034] A set W of nodes is defined and initially includes only node s. Aset V is likewise defined including all nodes in the digraph. Thecumulative cost of the path from s to any given node in V is initiallyunknown, and is, therefore, typically considered to be infinite. Thefollowing steps are then performed to find the cumulative cost from s toany node in V:

[0035] 1) The cumulative costs of the paths to all nodes y in V that areconnected to node s by an arc are determined by the cost of the arc.

[0036] 2) While W<>V:

[0037] 3) A node x in V is selected among all nodes in V that are not inW whose cumulative cost is minimal. If there is more than one minimalnode, then any node x may be selected.

[0038] 4) Node x is added to W.

[0039] 5) For each node y in V to which x has an outgoing arc:

[0040] 6) If y is not a composition node, then the cumulative cost ofthe path to y is the lesser of a) the current cumulative cost of thepath to y, if known, and b) the cumulative cost of the path to x plusthe cost of the arc connecting x to y;

[0041] 7) If y is a composition node, and all nodes that have outgoingarcs to y are in W, then the cumulative cost of the path to y is thelesser of a) the current cumulative cost of the path to y, if known, andb) the sum of the cumulative costs of the paths to all nodes that haveoutgoing arcs to y.

[0042] The most efficient path from node s to any other destinationnode, and ultimately to node t, is comprised of the arcs whose cost wasadded to the final cumulative cost of the destination node. From this,it may be seen that the most efficient path from S to T may be derivedas follows:

[0043] a) traverse each incoming arc of each node in T whose cost wasadded to the final cumulative cost of each current node, to arrive atone or more next nodes in the path;

[0044] b) traverse each incoming arc of each node arrived at in theprevious step whose cost was added to the final cumulative cost of eachnode arrived at in the previous step, to arrive at one or more nextnodes in the path;

[0045] c) repeat step b) until the currently-arrived-at nodes are thenodes of S. The traversed arcs together form the most efficient pathfrom S to T.

[0046] It may thus be seen that the method of FIGS. 3A and 3B may beused as a method of efficient data transformation when applied to a datatransformation table such as Table A hereinabove.

[0047] The method of FIGS. 3A and 3B may be alternatively understoodusing the following pseudocode: Input: A digraph D=(V,A), with costsCuv >= 0 on its arcs, and V having >=0 composition nodes and >= 0decomposition nodes. A set of sources nodes S such that each node in Sbelongs to V. Output: The lowest-cost paths from S to all nodes in V inan array p. Begin: Construct a new node, s, and add arcs with cost=0from s to every node in S. Construct a new node, t, and a newcomposition node t′. Add arcs with cost=0 from all the nodes in T to thecomposition node, t′, and from the composition node to t. Add all thenew nodes and arcs to the digraph D=(V,A). Set W := {s};p[s] := 0; forall y such that y is a node in V with an incoming arc from s do p[y] :=Csy; while W <> V do begin find min {p[y] : y is not in W} , say p[x];set W := W union {x}; for all y in V such that there is an arc from x toy do begin if y is not a composition node then p[y] := min {p[y], p[x] +Cxy} else if all nodes that have outgoing arcs to y are in W then p[y]:= min {p[y],sum of all p[z] where z has an outgoing arc to y} end endend

[0048] Thus, by applying the method of FIGS. 3A and 3B, the mostefficient path from s to t is determined as those arcs whose cost wasadded to the final cumulative cost of t. From this, the most efficientpath from S to T may be derived as described hereinabove.

[0049] Reference is now made to FIGS. 4A, 4B, 4C, and 4D which show thedigraph of FIG. 2 reflecting the application of the method of FIGS. 3Aand 3B to an exemplary query in which the most efficient path isdetermined for the transformation of the set of source nodes 102, 104,and 106 into the set of target nodes 108 and 110. FIG. 4A shows thedigraph of FIG. 2 for which s, t′, and t have been defined. In FIG. 4Bthe arcs shown in dashed lines represent those arcs that lay along themost efficient paths from s to any node in V, and in particular to t. InFIG. 4C the arcs shown in dashed lines represent those arcs that do notlay along any efficient path and are the complementary arcs to those inFIG. 4B. Finally, in FIG. 4D the arcs shown in dashed lines representthose arcs that lay along the most efficient paths from source nodes102, 104, and 106 to target nodes 108 and 110, being a subset of thearcs shown in FIG. 4B.

[0050] It is appreciated that one or more of the steps of any of themethods described herein may be omitted or carried out in a differentorder than that shown, without departing from the true spirit and scopeof the invention.

[0051] While the methods and apparatus disclosed herein may or may nothave been described with reference to specific hardware or software, itis appreciated that the methods and apparatus described herein may bereadily implemented in hardware or software using conventionaltechniques.

[0052] While the present invention has been described with reference toone or more specific embodiments, the description is intended to beillustrative of the invention as a whole and is not to be construed aslimiting the invention to the embodiments shown. It is appreciated thatvarious modifications may occur to those skilled in the art that, whilenot specifically shown herein, are nevertheless within the true spiritand scope of the invention.

What is claimed is:
 1. A digraph comprising: a plurality of ordinarynodes; at least one of: a composition node and, a decomposition node;and a plurality of arcs interconnecting any of said nodes.
 2. A digraphaccording to claim 1 wherein said composition node is connected to atleast two of said nodes via arcs incoming to said composition node andto one other of said nodes via an arc outgoing to said other node.
 3. Adigraph according to claim 1 wherein said decomposition node isconnected to one of said nodes via an arc incoming to said compositionnode and to at least two other of said nodes via arcs outgoing to saidother nodes.
 4. A digraph according to claim 1 wherein said ordinarynodes represent data formats.
 5. A digraph according to claim 4 whereina first one of said ordinary nodes connected via an outgoing one of saidarcs to a second one of said ordinary nodes represents a transformationof one data format into another.
 6. A digraph according to claim 1wherein any of said arcs has an associated non-negative cost.
 7. Amethod for constructing a digraph from a plurality of source-to-targettraversals, the method comprising: representing said sources and targetsas a plurality of ordinary nodes; representing any of said traversalshaving a one-to-one cardinality by connecting the source node of saidtraversal to the target node of said traversal by an arc outgoing fromsaid source node; and performing any of the following: representing anyof said traversals having a many-to-one cardinality by connecting thesource nodes of said traversal to a composition node by arcs outgoingfrom said source nodes, and by connecting said composition node to thetarget node of said traversal by an arc outgoing from said compositionnode; representing any of said traversals having a one-to-manycardinality by connecting the source node of said traversal to adecomposition node by an arc outgoing from said source node, and byconnecting said decomposition node to the target nodes of said traversalby arcs outgoing from said decomposition node; and representing any ofsaid traversals having a many-to-many cardinality by connecting thesource nodes of said traversal to a composition node by arcs outgoingfrom said source nodes, by connecting said composition node to adecomposition node by an arc outgoing from said composition node, and byconnecting said decomposition node to the target nodes of said traversalby arcs outgoing from said decomposition node.
 8. A method according toclaim 7 and further comprising associating a non-negative cost with anyof said arcs.
 9. A method of efficient path discovery in a digraphincluding a plurality of ordinary nodes, at least one of a compositionnode and a decomposition node, and a plurality of arcs interconnectingany of said nodes, the method comprising: providing a source node sconnected to a set S of source nodes in said digraph via outgoing arcsof zero cost; initializing to zero a cumulative cost of the path to s;providing a composition node t′ connected to a set T of target nodes insaid digraph via incoming arcs of zero cost; providing a target node tconnected to composition node t′ via an incoming arc of zero cost;defining a set W of nodes in said digraph initially including only nodes; defining a set V of all nodes in said digraph; determining thecumulative costs of the paths to all nodes y in V that are connected tonode s by an arc; while W<>V. selecting a node x in V from all nodes inV that are not in W whose cumulative cost is minimal; adding node x toW; and for each node y in V to which x has an outgoing arc: if y is nota composition node, determining the cumulative cost of the path to y asthe lesser of a) the current known cumulative cost of the path to y, andb) the cumulative cost of the path to x plus the cost of the arcconnecting x to y; if y is a composition node, and all nodes that haveoutgoing arcs to y are in W, determining the cumulative cost of the pathto y as the lesser of a) the current known cumulative cost of the pathto y, and b) the sum of the cumulative costs of the paths to all nodesthat have outgoing arcs to y.
 10. A method according to claim 9 andfurther comprising determining the most efficient path from node s to adestination node selected from any of said nodes as comprising the arcswhose cost was added to the final cumulative cost of said destinationnode.
 11. A method according to claim 9 and further comprisingdetermining the most efficient path from S to T as comprising the arcswhose cost was added to the final cumulative costs of the nodes of T.12. A method according to claim 11 wherein the step of determining themost efficient path comprises: a) traversing each incoming arc of eachnode in T whose cost was added to the final cumulative cost of eachcurrent node, to arrive at one or more next nodes in the path; b)traversing each incoming arc of each node arrived at in the previousstep whose cost was added to the final cumulative cost of each nodearrived at in the previous step, to arrive at one or more next nodes inthe path; and c) repeating step b) until the currently-arrived-at nodesare the nodes of S, wherein the traversed arcs together form the mostefficient path from S to T.
 13. A computer program embodied on acomputer-readable medium, the computer program comprising: a first codesegment operative to provide a source node s connected to a set S ofsource nodes in a digraph via outgoing arcs of zero cost, said digraphincluding a plurality of ordinary nodes, at least one of a compositionnode and a decomposition node, and a plurality of arcs interconnectingany of said nodes; a second code segment operative to initialize to zeroa cumulative cost of the path to s; a third code segment operative toprovide a composition node t′ connected to a set T of target nodes insaid digraph via incoming arcs of zero cost; a fourth code segmentoperative to provide a target node t connected to composition node t′via an incoming arc of zero cost; a fifth code segment operative todefine a set W of nodes in said digraph initially including only node s;a sixth code segment operative to define a set V of all nodes in saiddigraph; a seventh code segment operative to determine the cumulativecosts of the paths to all nodes y in V that are connected to node s byan arc; an eighth code segment operative, while W<>V, to: select a nodex in V from all nodes in V that are not in W whose cumulative cost isminimal; add node x to W; and for each node y in V to which x has anoutgoing arc: if y is not a composition node, determin the cumulativecost of the path to y as the lesser of a) the current known cumulativecost of the path to y, and b) the cumulative cost of the path to x plusthe cost of the arc connecting x to y; if y is a composition node, andall nodes that have outgoing arcs to y are in W, determin the cumulativecost of the path to y as the lesser of a) the current known cumulativecost of the path to y, and b) the sum of the cumulative costs of thepaths to all nodes that have outgoing arcs to y.